arXiv:1502.03296vl [physics.soc-ph] 11 Feb 2015 


Statistical laws in linguistics 
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Zipf’s law is just one out of many universal laws proposed to describe statistical regularities in 
language. Here we review and critically discuss how these laws can be statistically interpreted, 
fitted, and tested (falsified). The modern availability of large databases of written text allows for 
tests with an unprecedent statistical accuracy and also a characterization of the fluctuations around 
the typical behavior. We find that fluctuations are usually much larger than expected based on sim¬ 
plifying statistical assumptions (e.g., independence and lack of correlations between observations). 
These simplifications appear also in usual statistical tests so that the large fluctuations can be er¬ 
roneously interpreted as a falsification of the law. Instead, here we argue that linguistic laws are 
only meaningful (falsifiable) if accompanied by a model for which the fluctuations can be computed 
(e.g., a generative model of the text). The large fluctuations we report show that the constraints 
imposed by linguistic laws on the creativity process of text generation are not as tight as one could 
expect. 
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I. INTRODUCTION 

“... language in use’ eannot be studied without 
statisties” Gustav Herdan (1964) [1] 

In the past 100 years regularities in the frequency of 
text constituents have been summarized in the form of 
linguistie laws. For instance, Zipf’s law states that the 
frequency / of the r-th most frequent word in a text is 
inversely proportional to its rank: / oc 1/r [2]. This 
and other less famous linguistic laws are one of the main 
objects of study of quantitative linguisties [SSj. 

Linguistic laws have both theoretical and practical im¬ 
portance. They provide insights on the mechanisms of 
text (language, thought) production and are also crucial 


in applications of statistical natural language processing 
(e.g., information retrieval). Both the generative and 
data-analysis views of linguistic laws are increasingly im¬ 
portant in modern applications. Data-mining algorithms 
profit from accurate estimations of the vocabulary size 
of a collection of texts (corpus), e.g., through Heaps’ 
law discussed in the next section. Methods for the au¬ 
tomatic generation of natural language can profit from 
knowing the linguistic laws underlying usual texts. For 
instance, linguistic laws may be included as (additional) 
constraints in the space of possible (Markov generated) 
texts |9] and can thus be considered as constrains to the 
creativity of authors. 


Besides giving an overview on various examples of lin¬ 
guistic laws (Sec. E’ in this paper we focus on their 
probabilistic interpretation (Sec. 1III[ ), we discuss differ¬ 
ent statistical methods of data analysis (Sec. [Tv]) , and 
the possibilities of connecting different laws (Sec.lV[). The 
modern availability of large text databases allows for an 
improved view on linguistic laws that requires a careful 
discussion of their interpretation. Typically, more data 
confirms the observations motivating the laws - mostly 
based on visual inspection - but makes increasingly dif¬ 
ficult for the laws to pass statistical tests designed to 
evaluate their validity. This leads to a seemingly contra¬ 
dictory situation: while the laws allow for an estimation 
of the general behavior (e.g., they are much better than 
alternative descriptions), they are strictly-speaking falsi¬ 
fied. The aim of this contribution is to present this prob¬ 
lem and discuss alternative interpretations of the results. 
We argue that the statistical analysis of linguistic laws 
often shows long-range correlations and large (topical) 
fluctuations. We conclude that null models accounting 
for these observations are often ignored yet crucial in the 
tests of the validity of linguistic laws. 
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FIG. 1: Examples of linguistic laws: (a) Zipf, (b) Menzerath-Altmann, and (c) Heaps laws. Data from one book (green, Moby 
Dick by H. Melville) and for the English Wikipedia (red) are shown. Dotted (black) lines are the linguistic laws with arbitrary 
parameter, chosen for visual comparison (see Appendix for details). 


II. EXAMPLES AND OBSERVATIONS rank as 


An insightful introduction to Linguistic Laws is given 
in Ref. [5] by Kohler, who distinguishes between three 
kinds of laws as follows: 

1. “The first kind takes the form of probability distri¬ 
butions, i.e. it makes predictions about the number 
of units of a given property.” 


fir) = 


/(I) 


( 1 ) 


where /(I) is the frequency of the most frequent word. 
The above expression cannot hold for large r because for 
any /(I) > 0 , there is an r* such that X]r=i /(l)/^ > 1 - 
Taking also into account that /(I) may not be the best 
proportionality factor, a modern version of Zipf’s law is 


2. “The second kind of law is called the functional 
type, because these laws link two (or more) vari¬ 
ables, i.e. properties.” 

3. “The third kind of law is the developmental one. 
Here, a property is related to time.” (time may be 
measured in terms of text length) 

We use the term linguistic law to denote quantitative 
relationships between measurements obtained in a writ¬ 
ten text or corpus, in contrast to syntactic rules and to 
phonetic and language-change laws (e.g., Grimm’s law). 
We assume that the laws make statements about individ¬ 
ual texts (corpus) and are exact in an appropriate limit 
(e.g., large corpus) [65] . Each law contains parameters 
which we denote by Greek letters a, /3, 7 , and often refer 
to the frequency f{q) of a quantity q in the text (with 
Hq) = !)• Probabilities are denoted by P{q). 

Next we discuss in detail one representative example 
of each of the three types of laws mentioned above: Zipf, 
Menzerath-Altmann, and Heaps laws, respectively, see 

Fig.[T] 

A. Zipf’s law 


/(’■) = ( 2 ) 

with az > 1, see Fig. [^a). The analogy with other 
processes showing fat-tailed distribution motivates the 
alternative formulation 

PU) = -4. (3) 

where P{f) is the fraction of the total number of words 
(probability) that have frequency /. Formulations (© 
and can be mapped to each other with = 1 ^ 

l/a HlIIllSS]. 

B. Menzerath-Altmann law 

The Menzerath-Altmann law received considerable at¬ 
tention after the works of Gabriel Altmann [na na. 
Menzerath’s general (qualitative) statement originat¬ 
ing from his observations about phonemes is that “the 
greater the whole the smaller its parts”. The quantita¬ 
tive law intended to describe this observation is m 

y = , (4) 


Zipf’s law is the best known linguistic law (see, e.g., 
Ref. m for historical references). In an early and simple 
formulation, it states that if words (types) are ranked ac¬ 
cording to their frequency of appearance r = 1,2,...,V, 
the frequency /(r) of the r-th word (type) scales with the 


where x measures the length of the whole and y the (aver¬ 
age) size of the parts. One example [16] is obtained com¬ 
puting for each word w the number of syllables and 
the number of phonemes . The length of the word (the 
whole) is measured by the number of syllables x^^ while 
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TABLE I: List of linguistic laws. 


Name of the law 

Observables 

Functional form 

References 

Zipf 

/: freq. of word w] r: rank of re in / 

f{r) = I3zr~°’^ 

[ansHii] 

Menzerath-Altmann 

X : length of the whole; y : size of the parts 

V = aMX^^ 

HUE] 

Heaps 

V : number of words; N : database size 

V ~ iV“" 

[DIMES] 

Recurrence 

T : distance between words 

P(r) ~ exp (otr)^ 

EIEIES] 

Long-range correlation 

C{r): autocorrelation at lag r 

C(r) ~ 

ESJESj 

Entropy Scaling 

H : Entropy of text with blocks of size n 

H ~ an^ + yn 

ESIEq] 

Information content 

1(1) : Information of word with length 1 

1(1) = a + 

ElED 

Taylor’s law 

a: standard deviation around the mean y 

a ^ 

m 

Networks 

Topology of lexical/semantic networks 

various 

|32H35| 


the length of the parts is measured for each word as the 
average number of phonemes per syllable = Zy^jx^. 
The comparison to the law is made by averaging yyj over 
all words w with Xyj = x, see Fig. Qb). The ideas of 
Menzerath-Altmann law and Eq. have been extended 
and applied to a variety of problems, see Ref. m and 
references therein. 


C. Heaps’ law 

Heaps’ law states that the number of different words V 
(i.e., word types) scales with database size N measured 
in the total number of words (i.e., word tokens) as pIfTS) 

Vr^N^^. (5) 

In Fig. [^c) this relationship is shown in two different 
representations. For a single book, the value of N is 
increased from the first word (token) until the end of 
the book so that V{N) draws a curve. For the English 
Wikipedia, each article is considered as a separate doc¬ 
ument for which V and N are computed and shown as 
dots. 

The non-trivial regularities and the similarity between 
the two disparate databases found for the three cases an¬ 
alyzed in Fig. [^strongly suggest that the three linguistic 
laws summarized above capture important properties of 
the structure of texts. Additional examples of linguistic 
laws are listed in Tab. [T| see also the vast literature in 
quantitative linguistics [3H6]. The (qualitative) observa¬ 
tions reported above motivate us to search for quantita¬ 
tive analysis that match the requirements of applications 
and the accuracy made possible through the use of large 
corpora. The natural questions that we would like to 
address here are: Are these laws true (compatible with 
the observations)? How to determine their parameters? 
How much fluctuations around them should be expected 
(allowed)? Are these laws related to each other? Before 
addressing these questions we discuss how should one in¬ 
terpret linguistie laws. 


III. INTERPRETATION OF LINGUISTIC LAWS 

In Chap. 26 Text Laws of Ref. [3], Hfebicek argues that 

“...the notion law (in the narrower sense 
seientifie law) in linguistics and especially in 
quantitative linguistics ... need not obtain 
some special comprehension different from its 
validity in other sciences. Probably, the best 
delimitation of this concept can be found 
in the works by the philosopher of scientific 
knowledge Karl Raimund Popper...” 

This view is also emphasized by Kohler in Ref. [5], who 
distinguishes laws from rules and states that a “ signifi- 
eant differenee is that rules ean be violated - laws (in the 
seientifie sense) eannot.^\ 

Such a straight-forward identification between linguis¬ 
tic and scientific laws masks the central role played by 
statistics (and probability theory) in the interpretation 
of linguistic laws. To see this, first notice that these 
laws do not directly affect the production of (grammat¬ 
ically and semantically) meaningful sentences because 
they typically involve scales much larger or shorter than 
a sentence. It is thus not difficult to be convinced that a 
creative and persistent daemon [66], trained in the tech¬ 
niques of eonstrained writing |37|, can generate under¬ 
standable and arbitrary long texts which deliberately vi¬ 
olate any single law mentioned above. In a strict Poppe- 
rian sense, a single of such demonic texts would be suf¬ 
ficient to falsify the proposed laws. Linguistic laws are 
thus different from syntactic rules and require a different 
interpretation than, e.g., the laws of classical Physics. 

The central role of statistics in Quantitative Linguistics 
was emphasized by its founding father Gustav Herdan: 

“The distinction between language laws in 
the conventional sense and statistical laws of 
language corresponds closely to that between 
the classical laws of nature, or the physical 
universe, and the statistical laws of modern 
physics.” [T] 

Altmann, when discussing Menzerath law m, also em¬ 
phasizes that “this law is a stochastic one”, and Kohler [3| 
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refers to the concept of stochastic hypothesis. There are 
at least two instances in which a statistical interpretation 
should be included: 

1. In the statement of the law, e.g., in Zipf’s law the 
probability of finding a word with frequency / de- 
cays as P{ f) ~ 

2. In the interpretation of the law as being typical in 
a collection of texts, e.g., in Heaps’ law the vocab¬ 
ulary H of a (typical) text of size N is V r\j N%. 

The demonic texts mentioned above would be considered 
untypical (or highly unlikely). Statistical laws in at least 
one of these senses are characteristic not only of modern 
Physics, as pointed out by Herdan, but also of different 
areas of natural and social sciences: Benford’s law pre¬ 
dicts the frequency of the first digit of numbers appearing 
in a corpus and the Gutenberg-Richter law determines 
the frequency of earthquakes of a given magnitude. The 
analysis of these laws, including possible refutations, have 
to be done through rigorous statistical methods, the sub¬ 
ject of the next section. Important aspects of linguistic 
laws not discussed in detail in this Chapter include: (i) 
the universality and variability of parameters of linguistic 
laws (e.g., across different languages [2Ql 1^ l35l [38l [40) . 
as a function of size [39] and degree of mixture of the 
corpus im, styles m, and age of speakers my, and (ii) 
the relevance and origins of the laws. This second point 
was intensively debated for Zipf’s law HIHiig, with 
quantitative approaches based on stochastic processes - 
e.g., the Monkey typewriter model iia), rich-get-richer 
mechanisms [TOHii na HB - and on optimization prin¬ 
ciples - e.g., between speaker and hearer miiiiisi or 
entropy maximization [MlllTj. 

IV. STATISTICAL ANALYSIS 

In Secjnjwe argued in favor of linguistic laws by show¬ 
ing a graphical representation of the data (Fig. The 
widespread availability of large databases and the appli¬ 
cations of linguistic laws require and allow for a more 
rigorous statistical analysis of the results. To this end we 
assume the linguistic law can be translated in a precise 
mathematical statement about a curve or distribution. 
This distribution has a set of parameters and observa¬ 
tions. Legitimate questions to be addressed are: 

(1) Fitting. What are the best parameters of the law 
to describe a given data? 

(2) Model Comparison. Is the law better than an 
alternative one? 

(3) Validity. Is the law compatible with the observa¬ 
tions? 

These points are representative of statistical analysis per¬ 
formed more generally and should proceed any more fun¬ 
damental discussion on the origin and importance of a 


specific law. Below we discuss in more details how each 
of the three points listed above has been and can be ad¬ 
dressed in the case of linguistic laws. 

A. Graphical approaches 

Visual inspection and graphical approaches were the 
first type of analysis of linguistic laws and are still widely 
used. One simple and still very popular fitting approach 
is least squares (minimize the squared distance between 
data and models). Often this is done in combination with 
a transformation of variables that maps the law into a 
straight line (e.g., using logarithmic scales in the axis or 
taking the logarithm of the independent and dependent 
variable in the Zipf’s and Heaps’ laws). These transfor¬ 
mations are important to visually detect patterns and are 
valuable part of any data analysis. However, they are not 
appropriate for a quantitative analysis of the data. The 
problem of fitting straight lines in log-log scale is that 
least-square fitting assumes an uncertainty (fiuctuation) 
on each point that is independent, Gaussian distributed, 
and equal in size for all fitted points. These assump¬ 
tions are usually not justified (see, e.g.. Refs. |48l |49| 
for the case of fitting power-law distributions), while at 
the same time the uncertainties are modified through the 
transformation of variables (such as using the log scale). 
Furthermore, quantifying the goodness-of-fit by using the 
correlation-coefficient in these scales is insufficient to 
evaluate the validity of a given law. A high quality of the 
fit indicates a high correlation between data and model, 
but is unable to assign a probability for observations and 
thus it is not suited for a rigorous test of the law. 

B. Likelihood methods 

A central quantity in the statistical analysis of data is 
the likelihood £(x; a) that the data x was generated by 
the model (with a set of parameters a). 

(1) Fitting When fitting a model (law) to data the 
approach is to tacitly assume its validity and then search 
for the best parameters to account for the data. It corre¬ 
sponds to a search in the (multidimensional) parameter 
space a of the law for the value a that maximize jC. 

In laws of the first kind - as listed in Sec. HI - the 
quantity to be estimated from data is a probability dis¬ 
tribution P(x; (a). The probability of an observation Xj is 
thus given by P{xj] a). Assuming that all J observations 
are independent, the best parameter estimates a are the 
values of a that maximize the log-likelihood 

j 

\og^C = \ogP{xi,X 2 ,...xj-,a) ='^logP{xj;a), (6) 

i=i 

The need for Maximum Likelihood (ML) methods when 
fitting power-law distributions (such as Zipf’s law) has 
been emphasized in many recent publications. We refer 
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to the review article Ref. m and references therein for 
more details, and to Ref. m for fitting truncated distri¬ 
butions (e.g., due to cut-offs). 

In laws of the second and third kind - as listed in 
Sec. |TI| - the quantity to be described ^ is a function 
y = Vgi'K; d). Fitting requires assumptions regarding the 
possible fluctuations in ^(x). One possibility is to assume 
Gaussian fluctuations with a standard deviation cr(x). 
In this case, assuming again that the observations x are 
independent [52] 


loge ^ ^ 



yj^j) - Vgi^j) 

o-(xi) 


2 


(7) 


where the sum is over all observations j. The best es¬ 
timated parameters a are obtained minimizing = 
^ 2 ^ which maximizes (It). Least-squares 
fitting is equivalent to Maximum-Likenhood fitting only 
in the case of constant a (independent of x) [52] . 

(2) Model Comparison The comparison between two 
different functional forms of the law (ml and m2) 
is done comparing their likelihoods, e.g., through 
the log-likelihood ratio log^ Cmi /jC.m 2 [53]. A value 
log^Crni/^m 2 = 1 (“1) mcaus it is = 2.718... times 
more (less) likely that the data was generated by func¬ 
tion ml than by function m2. If the two models have 
a different number of parameters, one can penalize the 
model with higher number of parameters using, e.g. the 
Akaike information criterion IHl, or calculate the Bayes 
factor by averaging (in the space of parameters) over the 
full posterior distribution [55] . 

(3) Validity The probabilistic nature of linguistic 
laws requires statistical tests. One possible approach is 
to compute the probability (p-value) of having observa¬ 
tions similar to the data from a null model compatible 
with the linguistic law (which is assumed to be true). A 
low p-value is a strong indication that the null model is 
violated and may be used to refute the law (e.g., if p- 
value< 0.01). Defining a measure of distance D between 
the data and the model, the p-value can be computed as 
the fraction of finite-size realizations of the model (as¬ 
suming it is true) that show a distance D' > D. In the 
case of probability distributions - linguistic laws of the 
first kind in Sec. |ll] - the distance D is usually taken 
to be the Kolmogorov-Smirnov distance (the largest dis¬ 
tance between the empirical and fitted cumulative distri¬ 
butions). In the case of simple function - linguistic laws 
of the second and third kind in Sec. nn - one can consider 
D = X^. 

Application: Menzerath-Altmann law. We applied 
the likelihood analysis summarized above to the case of 
the Menzerath-Altmann law introduced in Sec. mi Our 
critical assumption here is that the law is intended to 
describe the average number of phonemes per syllable, y, 
computed over many words w with the same number of 
syllables x. Assuming the words are independent of each 
other, the uncertainty in y{x) is thus the standard error of 
the mean given by cry{x) = (Jw{x)/ y/A/'(x), where (Jw{x) 


is the (empirical) standard deviation over the words with 
x-syllables and N{x) is the number of such words. 

In Fig. [^and Tab.|TI|we report the fitting, model com¬ 
parison, and validity analysis for the Menzerath-Altmann 
law - Eq. - and three alternative functions with the 
same number of parameters. The results show that two 
of the three alternative functions (shifted power law and 
stretched exponential) provide a better description than 
the proposed law, which we can safely consider to be in¬ 
compatible with the data (p-value< 10“^). Considering 
the two databases, the stretched exponential distribution 
provides the best description and is not refuted. These 
results depend strongly on the procedure used to identify 
phonemes and syllables (see Appendix). 


C. Critical discussion 

In the next paragraphs we critically discuss the likeli¬ 
hood approach considering the example of Zipf’s law. 

Fitting as model comparison. In the beginning of this 
section we started with the distinction between fitting 
(i.e., fixing free parameters) and model comparison (i.e. 
choosing between different models). This division is di¬ 
dactic m, but from a formal point of view both proce¬ 
dures correspond to hypothesis testing because the free 
parameters of one fitting model can be thought as a 
continuous parameterization of different models which 
should be compared and selected according to their like¬ 
lihood [56] . This means that the points mentioned below 
apply equally well to both fitting and hypothesis test¬ 
ing (and, in most cases, also to test the validity of the 
models). 

Fitting ranks. Power-law fitting recipes [50]- em¬ 
ployed for linguistic [43] and non-linguistic problems - 
suggest to fit Zipf’s law using the distribution of frequen¬ 
cies P{f) given in Eq. §. However, it is also possible to 
use the rank formulation (§ m because the frequency 
of ranks /(r) is normalized /(r) = 1 and can thus 
be interpreted as a probability distribution. However, a 
drawback in fitting /(r) is that the process of ranking 
introduces a bias in the estimator [571 [58]. For instance, 
consider a finite sample from a true Zipf distribution con¬ 
taining ranks r = 1,..., oo. Because of statistical fluctu¬ 
ations, some of the rankings will be inverted (or absent) 
so that when we rank the words according to the obser¬ 
vations obtain ranks different from the ones drawn. This 
effect introduces bias in our estimation of the parameters 
(overestimating the quality of the fit). The words affected 
by this bias are the ones with largest ranks, which con¬ 
tribute very little to the estimation of the parameters of 
Zipf’s law (as discussed below). Therefore, we expect 
that this bias to become negligible for sufficiently large 
sample sizes. 

Representation matters. Equivalent formulations of 
the linguistic laws lead to different statistical analysis and 
conclusions [57] [58] . One example of this point is the use 
of transformations before the fitting is performed, such as 
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FIG. 2: Model comparison for the Menzerath-Altmann law. Data points are the average over all word (types) in a book (Moby 
Dick by H. Melville, as in Fig.[^. The curves show the best fits of the four alternative curves, as reported in Tab.[n| Left plot: 
the data in the original scales, as in Fig. Right plot: the distance between the curves and the points {y — y)/cry, where the 
uncertainty ay is the standard error of the mean. 


TABLE 11: Likelihood analysis of the Menzerath-Altmann law and three alternative functions. The parameters (d,/3,7) that 
maximize the likelihood Cm of model m were computed using the downhill simplex algorithm (using the Python library scipy). 
The reported p-value corresponds to the fraction of random realizations with a larger than the observed x^- each 
realization, one point y^(x) was generated at each x from a Gaussian distribution centered at the model prediction ym(x) with 
a standard deviation ay{x) given by the data. The best models and the results with p > 0.01 are shown in bold face. 



Menzerath-Altmann (MA) 
ax^ exp (— 7 x) 

Shifted power law Stretched exp. Polynom 

a(x/Sy aexp(/3x)'^ a +/3x + 

Results for one book (Moby Dick by H. Melville) 

(a,/3,7) 

loSg 
p-value 

(3.3,-0.12,-0.051) 

0 

< lO"'^ 

(2.8,-0.65,-0.19) (1.5,1.4,-0.51) (3.9,-0.69,0.066) 
33 25 -475 

0.611 0.064 < 10“® 

Results for English Wikipedia 

logg CijCuA 
p-value 

(3.2, -0.45, -0.064) 

0 

< 10"® 

(2.8,-0.70,-0.18) (1.6,1.5,-0.60) (3.8,-0.64,0.061) 
11 49 -1898 

2 X 10“® 0.93 < 10“® 


the linear fit of Zipf’s law in logarithmic scale discussed 
in Sec. |IV A[ The variables used to represent the linguis¬ 
tic law are also crucial when likelihood methods are used, 
as discussed above for the case of Zipf’s law represented 
in /(r) or P(/). While asymptotically these formulations 
are equivalent, the likelihood computed in both cases is 
different. In the likelihood of P(/), an observation corre¬ 
sponds to the frequency of a word type. This means that 
the most frequent words in the database count the same 
as words appearing only once (the hapax-legomenan). In 
practice, the part of the distribution that matters the 
most in the fitting (and in the likelihood) are the words 
with very few counts, which contribute very little to the 
total text. In the likelihood of /(r) the observational 
quantity is the rank r of each occurrence of the word 
meaning that each word token counts the same. This 
means that the frequent words contribute more and the 
fitting of /(r) is robust against rare words. Linear re¬ 
gression in log-log plot counts every point in the plot 
the same and, since there are more points for large r, 
low-frequency words dominate the fit. Using logarithmic 
binning, as suggested in Ref. [48], equalize the impor¬ 
tance of words across log{r). In summary, while fitting 


a straight line in log-log scale using logarithmic binning 
gives the same value for words across the full spectrum 
(in a logarithmic scale), the statistical rigorous methods 
of Maximum Likelihood will be dominated either by the 
most frequent (in case of fitting in /(r)) or least frequent 
(in case of fitting in P{f)) words. 

Beyond Zipf’s law, the reasoning above shows that 
even if asymptotically (i.e. infinite data) different for¬ 
mulations of a law are equivalent, the representation in 
which we test the law matters because it assumes a sam¬ 
pling process of the data. This in turn leads to different 
results when applied to finite and often noisy data and 
has to be taken into account when interpreting the re¬ 
sults. 


Applieation: Fitting Zipf’s law. In Fig.j^and Tab. U1 


we compare the different fitting methods described 
above. The visual agreement between data and the fitted 
curves reflects the different weights given by the meth¬ 
ods to different regions of the distribution as discussed 
above (high-frequency words for /(r) and low-frequency 


words for the other two cases). Not surprisingly. Tab. Ill 


shows that the estimated exponent a varies from method 
to method. This variation is larger than the variation 
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FIG. 3: Comparison of the Zipf’s law obtained using three different fitting methods. Results are shown for one book (Moby 
Dick by H. Melville, top row) and for the complete English Wikipedia (bottom row). Data is fitted using Maximum Likelihood 
(ML) in the frequency rank /(r) (left), ML in the frequency distribution P{f) ~ p{k) ( cent er), and least square (LS) in the log / 
vs. logr representation (right). Insets show the cumulative distributions. See Tab. |III| for the parameter a and significance 
test of the fits. In the plot in the center, instead of P{f) we use the distribution the unnormalized frequency p{k) (i.e., k is the 
number of occurrences of a word in the database). For ML fits, we used a discrete power law in /(r) and p(k) with support in 
[l,oo) (exponents were obtained using the downhill simplex algorithm of the Python library scipy). For the LS fit, we used a 
continuous straight line in log/(logr) [52] . 


TABLE III: Zipf’s law exponent obtained usir^ different fitting methods, see Fig. In the fit of P{f) (frequency) we obtain 
and calculate az = 1/(1 — Pqs. English version of the books were obtained from the Project Gutenberg, 

see Appendix. 



Rank: /(r) 

Frequency: P{f) 

Linear: log/(logr) 

Book 

dz 

p-value 

OLZ 

p-value 

dz 


Alice’s Adventures in Wonderland (L. Garroll) 

1.22 

1 

o 

V 

1.46 

< 10“^ 

1.21 

0.97 

The Voyage Of The Beagle (G. Darwin) 

1.20 

1 

o 

V 

1.59 

< 10“^ 

1.29 

0.97 

The Jungle (U. Sinclair) 

1.21 


1.45 

< 10“^ 

1.22 

0.98 

Life On The Mississippi (M. Twain) 

1.20 

1 

o 

V 

1.38 

< 10“^ 

1.16 

0.98 

Moby Dick; or The Whale (H. Melville) 

1.19 

1 

o 

V 

1.38 

< 10“^ 

1.15 

0.98 

Pride and Prejudice (J. Austen) 

1.21 


1.66 

< 10“^ 

1.35 

0.98 

Don Quixote (M. Gervantes) 

1.21 

< 10“‘‘ 

1.70 

< 10“^ 

1.38 

0.98 

The Adventures of Tom Sawyer (M. Twain) 

1.21 

1 

o 

V 

1.29 

< 10“^ 

1.12 

0.98 

Ulysses (J. Joyce) 

1.18 

< 

1.15 

< 10“^ 

1.03 

0.97 

War and Peace (L. Tolstoy) 

1.20 

< 

1.84 

< 10“^ 

1.44 

0.97 

English Wikipedia 

1.17 

< 

1.60 

< 10“^ 

1.58 

0.99 


across different databases. Large values of computed 
in the linear fit, usually interpreted as an indication of 
good fitting, are observed also when the p-value are very 
low. 

Correlated samples The failure of passing significance 
tests for increasing data size is not surprising because any 
small deviation from the null model becomes statistically 
significant. A possible conclusion emerging from these 


analysis is that power-law distributions are not as widely 
valid as previously claimed (see also Refs. [50l|59]), but 
often are better than alternative (simple) descriptions 
(see our previous publication Ref. [ 21 ] in which we con¬ 
sider two-parameter generalizations of Zipf’s law). The 
main criticism we have on this widely used framework 
of analysis is that it ignores the presence of correlations 
in the data: the computation of the likelihood in Eq. 















































































FIG. 4: Estimation of the frequency of a word in the hrst n word tokens of a book (Moby Dick by H. Melville). The red curve 
corresponds to the actual observation (word “water” in the left and word “whale” in the right) and the blue curve to the curve 
measured in a version of the book in which all word tokens were randomly shuffled. The shaded regions show the expected 
fluctuations (±2cr) assuming that the probability of using the word is given by the frequency of the word in the whole book 
(/(^ = that: (i) usage is random (blue region) - see also Ref. [7] or (ii) the time between successive usages of the word 

is drawn randomly from a stretched exponential distribution with exponent [3 = 0.5, as proposed in Ref. EH. 


assumes independent observations. Furthermore, this as¬ 
sumption leads to an underestimation of the expected 
fluctuations (e.g. KS-distance) in the calculation of the 
p-value when assessing the validity of the law. It is thus 
unclear in which extent a negative result in the valid¬ 
ity test (e.g., p-valued 0.01) is due to a failure of the 
proposed law or, instead, is due to the violation of the 
hypothesis of independent sampling. This hypothesis is 
known to be violated in texts m US]; the sequence of 
words and letters are obviously related to each other. In 
Fig-i we show that these correlations affect the estima¬ 
tion of the frequency of individual words, which show 
fluctuations much larger than those expected not only 
based on the independent random usage of words (Pois¬ 
son or bag of word models) but also in a null model in 
which burstiness is included [24l |25]. Altogether, this 
shows that the independence assumption - used to write 
the likelihood (|^ - is strongly violated and affects both 
the analysis based on /(r) (correlation throughout a 
book) and P{f) (/ can be thought as a finite size es¬ 
timation as the ones shown in Fig. |^. 

One approach to take into account correlations is to 
estimate a time for which two observations are indepen¬ 
dent, and then consider observations only after this time 
(a smaller effective sample size). Alternative approaches 
considered statistical tests for specific classes of stochas¬ 
tic processes (correlated in time) [60] or based on es¬ 
timations of the correlation coming from the data m- 
The application of these methods to linguistic laws is 
not straightforward because these methods fail in cases 
in which no characteristic correlation time exist. Books 
show such long-range correlations m, also in the posi¬ 
tion of individual words in books [26l [28| , in agreement 
with the observations reported in Fig. More gener¬ 
ally, correlations lead to a slower convergence to asymp¬ 
totic values and it is thus possible to create processes of 
text generation that comply to a linguistic law asymptot¬ 
ically but that (in finite samples) violate statistical tests 


based on independent sampling. The problem affects also 
model comparison and fitting because these problems are 
also based on the likelihood (in these cases, correlation 
affects all models and therefore it is unclear the extent 
in which it impacts the choice of the best model). 


V. RELATION BETWEEN LAWS 

In view of the different laws proposed to describe text 
properties, a natural question is the relationship between 
them (e.g., whether one law can be derived from an¬ 
other or whether there are generative processes that ac¬ 
count for more than one law simultaneously). For in¬ 
stance, Ref. [28] clarifies how the long-range correlation 
of texts is related to the skewed distribution of recur¬ 
rence time between words mill US] (a consequence of 
burstiness 0162 ). Another well-known relation is the 
connection between Heaps’ law and Zipf’s law [lann- 
|23l [39] (see Refs. [23l [^ |30l [62] for other examples). 
Here again the importance of fluctuations and an under¬ 
lying null model is often neglected. 

The need for a null model is evident if we consider a 
text in which all possible words appear once in the very 
beginning of the text, violating Heaps’ law, even though 
their frequency over the full text is still compatible with 
Zipf’s law. A typical null model is to consider that ev¬ 
ery word is used independently from the others with a 
probability equal to its global frequency. This probabil¬ 
ity is usually taken to be constant throughout the text 
(Poisson process), but alternative formulations consid¬ 
ering time-dependent frequencies lead to similar results. 
For this generative model, Zipf’s law (© leads to a Heaps’ 
law ^ with parameters an = l/<^z [H]- Similar null 
models are implicitly or explicitly assumed in different 
derivations [Il[2lH23l|39]. 

Figureshows that the connection between Zipf’s and 
Heaps’ law using the independent usage of words fails to 
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textiength: N 



FIG. 5: Relation between Zipf’s law and Heaps’ law in the English Wikipedia. Fixing the rank-frequency distribution of 
the complete English Wikipedia - shown in panel (a) - and assuming each word to follow a Poisson process (i.e., to be used 
randomly) with fixed frequency /(r), one obtains the blue curve for the Heaps’ law in (b). Considering each Wikipedia article 
separately - as shown by black dots in (b) - we estimate in a moving window centered in N the average /iv(iv) and standard 
deviation av{N) over all articles in the window. The dependence of fiv{N) on N is shown in (b) by a solid line. The dependence 
of av{N) on /j.v{N) is shown in (c) and reveals a different scaling than the one predicted by the Poisson model. Figure adapted 
from Ref. [23]. 


reproduce the fluctuations observed in data. In particu¬ 
lar, the fluctuations around the average vocabulary size 
V predicted from Heaps’ law scales linearly with N, and 
not as y/N as predicted by the independence assumption 
(through the central limit theorem). In Ref. [23] we have 
shown that this scaling - also known as Taylor’s law [63] 
- is a result of correlations in the usage among differ¬ 
ent words induced by the existence of topical structures 
inside and across books. 


VI. DISCUSSION 

It is common to find claims that a particular linguistic 
law is valid in a language or corpus. A closer inspec¬ 
tion for the statistical support of these claims is often 
disappointing. In this Chapter we performed a critical 
discussion of linguistic laws, the sense in which they can 
be considered valid, and the extent in which the evidence 
support its validity. We argued that linguistic laws have 
to be interpreted in a statistical sense. Therefore, model 
selection (also fitting) and the compatibility to data have 
to be performed computing statistical tests based on the 
likelihood (plausibility) of the observations. The statisti¬ 
cal analysis is far from being free of choices, both in terms 
of the methods employed and also about additional as¬ 
sumptions not contained in the original law, as discussed 
below. The analysis we presented above is intended to 
show that these choices matter and should be carefully 
discussed. The picture that emerges from the straight 
applications of the statistical tests above is that: (i) the 
linguistic laws are often the best simple description of 
the data, but (ii) the data is not generated according to 
it so that in a strict sense the validity of the law is fal¬ 
sified. This interpretation suggests that linguistic laws 
are useful and capture some of the ingredients seen in 
language, but are unable to describe observations in full 
detail even in the limit of large texts (possibly because of 


the existence of additional processes ignored by the law). 

The main limitation of the methods we described, and 
thus of the conclusions summarized above, is that they 
were based not only on the statement of the law but also 
on the hypothesis that observations are independent and 
identically distributed. This hypothesis is known to be 
violated in almost all observations of written language. 
It is thus unclear in which extent the rejection of the null 
model (small p-value) can be considered a falsification of 
the linguistic law. On the one hand, this reasoning shows 
the limitation of the statistical methods and the neces¬ 
sity to apply and develop tests able to deal with (long- 
range) correlated data. On the other hand, it shows that 
the usual statements of linguistic laws are incomplete be¬ 
cause they cannot be properly tested. A meaningful for¬ 
mulation of a linguistic law allows for the computation 
of the likelihood of the observations, e.g., it should be 
accompanied by a prediction of the fluctuations, a gen¬ 
erative model for the relevant variables, or, ultimately, 
a model for the generation of texts. Such models are 
usually interpreted as an explanation of the origin of the 
laws [nunmn and are absent from the statement of 
the linguistic laws, despite the fact that Herdan already 
drew attention to this point [T]: ^^The quantities which 
we call statistical laws being only expectations, they are 
subject to random fluctuations whose extent must be re¬ 
garded as part of the statistical law. ” In the same sense 
that a scientific law cannot be judged separated from a 
theory, linguistic laws are only fully defined once a gener¬ 
ative process is given. The existence of long-range corre¬ 
lations, burstiness, and topical variations lead to strong 
fluctuations in the estimations of observables in texts, 
including the quantities described by linguistic laws. 

Our findings have consequences to applications in in¬ 
formation retrieval and text generation. For instance, 
our results show that strong fluctuations around specific 
laws are observed and that results obtained using the in¬ 
dependence assumption (e.g., bag-of-words models) have 
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a limited applicability. Therefore, statistical laws should 
not be imposed too strictly in the generation of artifi¬ 
cial texts or in the analysis of unknown databases. Large 
fluctuations are as much a characteristic of language as 
the laws themselves and therefore the creativity in the 
generation of texts is much larger than the one obtained 
if laws are imposed as strict constraints. 

Finally we would like to mention that our conclusions 
apply also to other statistical laws beyond linguistic. In¬ 
variably, the increase of data size leads to a rejection 
of null-models, e.g. many recent works emphasize that 
claims of power-law distributions do not survive rigorous 
statistical tests nsnsniissi- However, the statistical tests 
employed in these references, and in most likelihood- 
based analysis, rely on the independence assumption of 
the observations (known to be violated in many of the 
treated cases). Nevertheless, we are not aware that this 
point has been critically discussed in the large number 
of publications on power-law fitting. The crucial role of 
mechanistic models in the fitting and statistical analysis 
of scaling laws was emphasized in Ref. [64] for urban- 
economic data. 

Acknowledgments: We thank Roger Guimera, 
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cussions. 

Appendix 

The books listed in Tab. |III| were obtained from Project 
Gutenberg (http://www.gutenberg.org). The books 
and data filtering are the same as the ones used in 
Ref. [28] (see the Supplementary information of that pa¬ 
per for further details). We removed capitalization and 
all symbols except the letters “a-z”, the number “0-9”, 
the apostrophe, and the blank space. A string of symbols 


between two consecutive blank spaces was considered to 
be a word. 

The English Wikipedia data was obtained from Wiki¬ 
media dumps (http://dumps.wikimedia.org/). The 
filtering was the same as the one used in Ref. [23], in 
which we removed capitalization and kept only those 
words (i.e. sequences of symbols separated by blank 
space) which consisted exclusively of the letters “a-z” and 
the apostrophe. 

The computation of Menzerath-Altmann law appear¬ 
ing in Figs.[^[^ and Tab. [XT] was done starting from the 
unique words (word type) in the database discussed in 
the previous paragraphs. For each word w we applied 
the following steps: 

1. Lemmatize using the WordNetLemmatizer (http: 
//wordnet .princeton. edu in the NLTK Python 
package http://www.nltk.org/) . 

2. Gount the number of syllables based on the 
Moby Hyphenation List by Grady Ward^ available 
at http://www.gutenberg.org/ebooks/3204 

3. Gount the number of phonemes based on 
The CMU Pronouneing Dietionary, version 0.7b 
available at www.speech.cs.cmu.edu/cgi-bin/ 
cmudict 

For the book Moby Diek by H. Melville, this procedure 
allowed to compute and for 11, 595 words, 66% of 
the total number of words (before lemmatization). For 
the Wikipedia, we obtain 60, 749 words, 1.7% of the total 
number. The low success in Wikipedia is due to the size 
of the database (large number of rare words) and the 
results depend more strongly on the procedure described 
above than on the database itself. 
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